Metadata extration and text categorization using Universal Resource Locator expansions

نویسندگان

  • Min-Yen KAN
  • Min-Yen Kan
چکیده

Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can indicate metadata about a resource. This paper explores the mining of URLs to yield categoric metadata about web resources via a three-phase pipeline of word segmentation, abbreviation expansion and classification. I apply this approach to the problem of subject metadata generation and quantify its performance relative to titleand document-based methods, both which require the retrieval of the source document.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Metadata extraction and text categorization using Universal Resource Locator expansions

Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can indicate metadata about a resource. This paper explores the mining of URLs to yield categoric metadata about web resources via a three-phase pipeline of word segmentation, abbreviation expansion and classification. I apply this approach to the problem of subject metadat...

متن کامل

Joint Web-Feature (JFEAT): A Novel Web Page Classification Framework

With the increasing amount of web pages over the internet, it has been a major concern to obtain information on the internet accurately at a reasonable cost with decent performance. A potential solution is through the classification of web pages into meaningful categories. An effective classification of web pages is of benefit to various applications such as web mining and search engines. Unlik...

متن کامل

Categorizing Learning Objects Based On Wikipedia as Substitute Corpus

As metadata is often not sufficiently provided by authors of Learning Resources, automatic metadata generation methods are used to create metadata afterwards. One kind of metadata is categorization, particularly the partition of Learning Resources into distinct subject categories. A disadvantage of state-of-the-art categorization methods is that they require corpora of sample Learning Resources...

متن کامل

Guest Editorial on Metadata

This is a special issue on the topic of metadata. An often-cited de®nition of metadata is `data about data'. In most cases, this means data that describe documents, for example, the author of a document, the date that a photograph was taken, or the Universal Resource Locator (URL) of a Web site. The World-Wide Web Consortium de®nes metadata as `machine understandable information for the Web' <h...

متن کامل

Metadata for electronic information resources: From variety to interoperability

Metadata serves several purposes. It supports resource discovery, locates the actual digital resource by inclusion of a digital identifier, organizes electronic resources bringing similar resources together and distinguishing dissimilar resources, provides administrative information for controlling the digital library, and provides technical, preservation and rights management information neede...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003